ocr system
olmOCR 2: Unit Test Rewards for Document OCR
Poznanski, Jake, Soldaini, Luca, Lo, Kyle
We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.
- Europe > Monaco (0.04)
- Asia > Middle East > Jordan (0.04)
E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition
--Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (L VLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art L VLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. T o better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability. Optical Character Recognition (OCR) is a cornerstone technology for digitizing documents, automating data entry, and extracting information from images containing typed, handwritten, or printed text.
- Information Technology (0.46)
- Law (0.40)
Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Jayatilleke, Nevidu, de Silva, Nisansa
Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.
- Asia > Sri Lanka (0.05)
- Asia > Singapore (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- (2 more...)
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Heakl, Ahmed, Sohail, Abdullah, Ranjan, Mukul, Hossam, Rania, Ahmed, Ghazi, El-Geish, Mohamed, Maher, Omar, Shen, Zhiqiang, Khan, Fahad, Khan, Salman
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
- North America > United States (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts
Sohail, Muhammad Abdullah, Masood, Salaar, Iqbal, Hamza
Synthetic datasets further enhanced these This study investigates the potential of Large Language Models models by mitigating the scarcity of annotated data for specific (LLMs), particularly GPT-4o, for Optical Character Recognition languages [3, 12] (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik, Progress in low-resource OCR remains limited, with studies focusing with English serving as a benchmark. Using a meticulously curated on Indic scripts, while structurally complex scripts--such as dataset of 2,520 images incorporating controlled variations in text those with ligatures or modified Cyrillic alphabets--remain underrepresented length, font size, background color, and blur, the research simulates [12]. Similarly, few studies have explored OCR pipelines diverse real-world challenges. Results emphasize the limitations of for Urdu and Bengali [1, 2], highlighting the need for customized zero-shot LLM-based OCR, particularly for linguistically complex solutions. LLMs have recently emerged as a promising OCR solution, scripts, highlighting the need for annotated datasets and fine-tuned leveraging multimodal capabilities for dynamic adaptation to models. This work underscores the urgency of addressing accessibility diverse scripts and challenging visual conditions [10, 13]. Studies gaps in text digitization, paving the way for inclusive and have demonstrated LLMs' ability to process textual content from robust OCR solutions for underserved languages.
- Asia > Tajikistan (0.14)
- North America > United States > District of Columbia > Washington (0.06)
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.05)
- (2 more...)
RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages
Kashid, Harshvivek, Bhattacharyya, Pushpak
Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (12 more...)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.
- South America > Argentina (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines
Yaseen, Blnd, Hassani, Hossein
Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.
- Asia > Middle East > Iraq > Baghdad Governorate > Baghdad (0.04)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (17 more...)
Enhancement of Bengali OCR by Specialized Models and Advanced Techniques for Diverse Document Types
Rabby, AKM Shahariar Azad, Ali, Hasmot, Islam, Md. Majedul, Abujar, Sheikh, Rahman, Fuad
This research paper presents a unique Bengali OCR system with some capabilities. The system excels in reconstructing document layouts while preserving structure, alignment, and images. It incorporates advanced image and signature detection for accurate extraction. Specialized models for word segmentation cater to diverse document types, including computer-composed, letterpress, typewriter, and handwritten documents. The system handles static and dynamic handwritten inputs, recognizing various writing styles. Furthermore, it has the ability to recognize compound characters in Bengali. Extensive data collection efforts provide a diverse corpus, while advanced technical components optimize character and word recognition. Additional contributions include image, logo, signature and table recognition, perspective correction, layout reconstruction, and a queuing module for efficient and scalable processing. The system demonstrates outstanding performance in efficient and accurate text extraction and analysis.
- Asia > Singapore (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.96)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.74)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey
Kasem, Mahmoud SalahEldin, Mahmoud, Mohamed, Kang, Hyun-Soo
Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language.
- Africa > Middle East > Egypt (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- Europe > Switzerland > Fribourg > Fribourg (0.04)
- (4 more...)
- Research Report > Promising Solution (1.00)
- Overview (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)